This notebook explores the Opération Fourmis public inventory data and summarises it across a few different potential variables of interest.


Overview

The coordinates associated with the public inventory data are probably approximately accurate overall. My concern is that even relatively minor inaccuracies in the coordinates will cause a lot of misalignment between the local GIS layers and where the ants were actually collected. Even with ideal location recording methods and conditions, the GPS on devices like smartphones typically have a precision of about ± 5m. Before assigning habitat or land use categories to the collections, we need to know how reliable that is as a method, and constrain the questions and analyses accordingly.


Locational precision

GEOPRECISION: Coordinate source

One concern with extracting local variables like habitat, land use, distance from the nearest road, or distance from the nearest building is that that requires a lot of confidence in the latitude and longitude associated with the point locations. The column GEOPRECISION indicates whether the location was extrapolated, corrected, or measured (or some combination).

Summary of geoprecision categorizations.
GEOPRECISION Tubes Percent Percent (non-NA)
mesuré 6159 89.5% 89.6%
extrapolé 624 9.1% 9.1%
extrapolé/corrigé 44 0.6% 0.6%
extrapolé mauvais 17 0.2% 0.2%
NA 12 0.2% -
mesuré/corrigé 11 0.2% 0.2%
extrapolé (base tube précédent) 6 0.1% 0.1%
extrapolé (église par défaut) 5 0.1% 0.1%
extrapolé (gare par défaut) 4 0.1% 0.1%
extrapolé/corrigé (église par défaut) 1 0.0% 0.0%

The coordinates were mostly measured directly by the collector, and only a small proportion were extrapolated badly. In theory, we could assume that mesuré, extrapolé, extrapolé/corrigé, and mesuré/corrigé indicate that the coordinates can be used directly.

Digits reported in decimal degrees

The number of reported digits is an estimate of precision for coordinates reported in decimal degrees, but not for the Swiss coordinate system which reports 6 digits no matter what. For latitude and longitude at the equator, an arc-degree corresponds with about 111km. At a longitude of 46ºN, an arc-degree is 76.5km.

Decimals Precision (Lat.) Precision (Lon.)
1 ± 5500 m ± 3825 m
2 ± 555 m ± 383 m
3 ± 55.5 m ± 38.3 m
4 ± 5.55 m ± 3.83 m
5 ± 0.555 m ± 0.383 m
6 ± 0.0555 m ± 0.0383 m

The reported digits can be used to set a minimum bound if, e.g., only 2 digits are reported, but typically devices will report many digits even if they are not justified. There were 3945 tubes (57.3%) reporting the coordinates in decimal degrees, with the rest using the 6-digit Swiss coordinates and no estimate of precision. The decimal degree coordinates include 686 tubes with coordinates extrapolated based on the reported locality. The reliability of the extrapolated coordinates for extracting local variables like habitat or land use type rely on a clear description of the habitat by the collector.

Lat/Lon decimal accuracy (joint coarsest).
Decimals Tubes Percent
1 1 0.0%
2 52 1.3%
3 165 4.2%
4 670 17.0%
5 669 17.0%
6 1226 31.1%
7 229 5.8%
8 933 23.7%

Typically, smartphones are accurate under good conditions to about 5m in radius, with worse performance around buildings, bridges, trees, etc. It therefore seems likely that coordinates with >5 decimal places are overestimating precision. More importantly, the 5.5% of locations with fewer than 4 should not be taken as-is with a high degree of confidence. Again, this metric isn’t possible with the locations recorded with the Swiss coordinate system (2938 tubes: 43%), but it seems reasonable that the distribution of precision would be roughly similar.

Filtering

For extracting local conditions based on point locations, it seems reasonable to buffer all points with 5-10m, with the local habitat or land use type assigned as the dominant category within the buffer. The buffer should not affect distance to nearest road, aside from reducing most distances by a uniform amount and reducing points with distances less than the buffer radius to 0m.

It is also a good idea to remove tubes with GEOPRECISION == "extrapolé mauvais" and possibly "extrapolé (base tube précédent)", "extrapolé (église par défaut)", "extrapolé (gare par défaut)", "extrapolé/corrigé (église par défaut)" as the uncertainty seems likely to be greater than 5-10m. Lastly, tubes with fewer than 3 decimals for the lat/lon coordinates should also be removed for the same reasons.

geo_exclude <- c("extrapolé mauvais", 
                 "extrapolé (base tube précédent)", 
                 "extrapolé (église par défaut)", 
                 "extrapolé (gare par défaut)", 
                 "extrapolé/corrigé (église par défaut)")
pub_filt <- ant$pub %>% 
  filter(!is.na(GEOPRECISION)) %>%
  filter(!GEOPRECISION %in% geo_exclude) %>%
  filter(is.na(LATITUDE) | nchar(LATITUDE) > 5) %>% # Swiss coords | Lat decimals
  filter(is.na(LONGITUDE) | nchar(LONGITUDE) > 4) # Swiss coords | Lon decimals
pub.5m <- pub_filt %>% st_buffer(dist=5)
pub.10m <- pub_filt %>% st_buffer(dist=10)

Habitat with locational uncertainty

Habitat datasets

There are three land cover / land use datasets available:
- Habitat layer created for the structured sampling
- CORINE Land Cover, which has a broader legend and is consistent across Europe, but uses a minimum mapping unit of 25 ha (500mx500m)
- Land use for largely agricultural land in Vaud in 2019

CORINE and the Opération Fourmis dataset have full coverage across Vaud, while the detailed land use dataset is mostly restricted to open canopy areas in the lower elevations (OpFo, CORINE, VD).

opfo CORINE VD

Here is a random area within Vaud showing the differences. The grid is 1km x 1km, with the public inventory tubes shown as the small black points (with 5m and 10m buffers), and building footprints from open street maps. (OpFo, CORINE, VD).
opfo CORINE VD

Zoomed in on the edge of a town (OpFo, CORINE, VD):
opfo CORINE VD

Zoomed in close on a couple of points (OpFo, CORINE, VD):
opfo CORINE VD

The land use data from the canton describes the reported usage in 2019, generally for open canopy habitats. The usage is quite detailed, with a focus on agriculture. Organic vs. conventional methods are not included. It uses 3-digit codes where the first digit indicates (roughly) 500: crops, 600: pastures, 700: permanent agriculture, 800: other?, and 900: other?. This table shows the categories with corresponding area and percent of the total.

CORINE is often used in the literature, but it probably is not very helpful for us, particularly since we have access to datasets that are more thematically and spatially accurate and precise since the extent is limited to Vaud. It could potentially be useful for defining broadly whether samples were in cities/towns since the legend includes Continuous/Discontinuous urban fabric (in center column above, purple = discontinuous urban fabric). For the other datasets, the main concern is whether the point locations are reliably precise enough to align them with the layer polygons.

Habitat extraction: Points and buffers

Using the same habitat categories as the structured samples (first column of plots above), we can calculate the habitat for each tube as the point location, the dominant habitat within 5m, and the dominant habitat within 10m.

We can assign the habitat type for each tube as either the habitat at the point location, ignoring uncertainty, or the dominant habitat within the 5m or 10m buffer. Larger buffers will obviously include more habitat categories, and samples collected along roads or edges would most likely be mis-categorized since those habitat types are unlikely to have the greatest coverage within a 5m or 10m radius. Conversely, even slight inaccuracies in the coordinates would result in mis-categorization of these samples based on the point locations. Assigning a habitat to each point with any degree of confidence is not trivial.

HABITAT: Comparison of reported and extracted OpFo habitats

The public samples included a field for habitat, and 3348 tubes (49.3%) include a free-form entry. However, these were not standardized, and there were 1572 unique responses. They range from extremely precise about where the ant was captured to rather general. Some seem to describe the diameter of the tree where the ant was collected.

Many of the habitats used for the OpFo structured samples are unlikely to have direct matches that would allow for unambiguous categorization. Searches for keywords could give an idea of how well the extracted habitat matches the stated habitat for the (mostly) unambiguous keywords.

Forest

Forests are generally large habitat polygons, and most HABITAT descriptions including the word forêt should be describing tubes collected in forest habitat. Edges, borders, and clearings can be filtered out to look at a sort of ‘best case’ scenario.

## Descriptions with 'for*t': 412
Habitat extractions for tubes with HABITAT reported as forest.
Categorie n_pt n_5m n_10m pct_pt pct_5m pct_10m
Autre 23 38 116 6.9% 11.4% 34.9%
CulturePerm 1 1 1 0.3% 0.3% 0.3%
ForetConifere 59 152 64 17.8% 45.8% 19.3%
ForetFeuillus 44 60 71 13.3% 18.1% 21.4%
ForetMixe 32 36 33 9.6% 10.8% 9.9%
lisiere 18 2 2 5.4% 0.6% 0.6%
pierrier 1 1 1 0.3% 0.3% 0.3%
transport 115 1 NA 34.6% 0.3% NA
zalluviale 7 8 8 2.1% 2.4% 2.4%
ZoneConstruite 30 32 36 9.0% 9.6% 10.8%
NA 2 1 NA 0.6% 0.3% NA

The 10m buffer ends up including a lot more Autre, indicating that those points are near the forest edge even if they aren’t specified as such. The high proportion of point locations classified as transport could reflect that the ants were collected along a road in the forest, or that the coordinates were recorded after returning to the car.

Lisière

The a priori expectation is that the point locations should be somewhat better for narrow habitat types like lisière. I would also expect poor performance across all methods, since inaccuracy in the point location is likely to move the point outside the habitat polygon, and buffers will include more non-target habitat types.

## Descriptions with 'lisi.re': 205
Habitat extractions for tubes with HABITAT reported as lisière
Categorie n_pt n_5m n_10m pct_pt pct_5m pct_10m
Autre 80 107 115 39.0% 52.2% 56.1%
ForetConifere 12 20 17 5.9% 9.8% 8.3%
ForetFeuillus 20 14 17 9.8% 6.8% 8.3%
ForetMixe 15 28 21 7.3% 13.7% 10.2%
lisiere 33 1 NA 16.1% 0.5% NA
marais 3 3 3 1.5% 1.5% 1.5%
pierrier 8 8 8 3.9% 3.9% 3.9%
PrairieSeche 1 1 1 0.5% 0.5% 0.5%
transport 20 NA NA 9.8% NA NA
ZoneConstruite 13 22 21 6.3% 10.7% 10.2%
zalluviale NA 1 1 NA 0.5% 0.5%
CulturePerm NA NA 1 NA NA 0.5%

As expected, the point locations capture lisière best, but it is still only 16% of the tubes with lisi.re in the HABITAT description.

Roads

Like for lisière, the a priori expectation is that the point locations should be better for transport, but with relatively poor performance across all methods. There are many descriptions in HABITAT that use the word chemin, but that’s probably used more often for trails rather than actual roads.

## Descriptions with 'rue' or 'route': 105
Habitat extractions for tubes with HABITAT reported as transport
Categorie n_pt n_5m n_10m pct_pt pct_5m pct_10m
Autre 33 48 58 31.4% 45.7% 55.2%
CulturePerm 1 3 3 1.0% 2.9% 2.9%
ForetConifere 2 3 3 1.9% 2.9% 2.9%
ForetMixe 2 9 4 1.9% 8.6% 3.8%
lisiere 5 1 NA 4.8% 1.0% NA
transport 37 NA NA 35.2% NA NA
ZoneConstruite 25 40 35 23.8% 38.1% 33.3%
PrairieSeche NA 1 1 NA 1.0% 1.0%
ForetFeuillus NA NA 1 NA NA 1.0%

More tubes with HABITAT descriptions including rue and route are classified as transport based on point locations, but it is still only about a third. Using the buffers, they are all categorized as the surrounding (non-road) habitats.

Zone Construite

The ZoneConstruite category should also be unambiguous.

## ZC keywords: maison|appartement|étage|balcon|cuisine
## Descriptions with keywords: 123
Habitat extractions for tubes with HABITAT entries containing a ZC keyword
Categorie n_pt n_5m n_10m pct_pt pct_5m pct_10m
Autre 4 4 4 3.3% 3.3% 3.3%
CulturePerm 3 4 10 2.4% 3.3% 8.1%
ForetFeuillus 1 1 1 0.8% 0.8% 0.8%
ForetMixe 2 2 2 1.6% 1.6% 1.6%
lisiere 4 NA NA 3.3% NA NA
transport 8 NA NA 6.5% NA NA
ZoneConstruite 101 111 105 82.1% 90.2% 85.4%
ForetConifere NA 1 1 NA 0.8% 0.8%

Generally good correspondence, with perhaps slightly better matching using the 5m buffer: 90.2% instead of 82.1% (points) or 85.4% (10m buffer).

Gardens

HABITAT descriptions and OpFo habitats

Some of the HABITAT descriptions specify that they were collected in gardens. My expectation is that these tubes should almost entirely categorized as ZoneConstruite, Autre, and CulturePerm.

## Descriptions with 'jardin' or 'potag*': 257
Habitat extractions for tubes with HABITAT entries containing a jardin keyword
Categorie n_pt n_5m n_10m pct_pt pct_5m pct_10m
Autre 23 26 27 8.9% 10.1% 10.5%
CulturePerm 12 12 16 4.7% 4.7% 6.2%
ForetConifere 5 7 7 1.9% 2.7% 2.7%
ForetFeuillus 3 3 3 1.2% 1.2% 1.2%
ForetMixe 2 2 2 0.8% 0.8% 0.8%
lisiere 7 NA NA 2.7% NA NA
transport 28 NA NA 10.9% NA NA
ZoneConstruite 177 207 202 68.9% 80.5% 78.6%

The buffers both place about 95% of the tubes in ZoneConstruite, Autre, or CulturePerm, compared with 83% of the point locations, which include a higher percentage of transport.

Gardens in and out of CORINE-defined urban areas

As a first approximation, we could classify points as inside or outside urban areas using the CORINE land cover categories (1– indicate human-dominated types). To categorize tubes as coming from gardens, there are two options: 1) use the HABITAT descriptions as above, including all tubes with jardin or potage in the description, or 2) using the 909 Jardin potager category in the Vaud land use dataset. Unfortunately, there are no tubes that are categorized as 909 Jardin potager based on location, regardless of buffering.

Corine: Point
Garden Non-Urban Urban
FALSE 3962 2567
TRUE 66 191
Corine: 5m
Garden Non-Urban Urban
FALSE 3961 2568
TRUE 66 191
Corine: 10m
Garden Non-Urban Urban
FALSE 3955 2574
TRUE 66 191

There are too few tubes with CORINE classifications of 111 Continuous urban fabric, which identifies parts of the few largest cities in Vaud. The 112 Discontinuous urban fabric identifies most (but not all) towns, so the comparison of gardens between cities vs. non-cities would need to be between urban and non-urban categories.
CORINE

Another possibility would be to categorize communes based on population (that’s the smallest unit I’ve found).

For the distribution of population sizes among communes, there isn’t much of a clear breakpoint aside from Lausanne.

Habitat comparison to Vaud

OpFo habitats

Using the habitat types from the structured samples, the public dataset clearly overrepresents ZoneConstruite and underrepresents Autre.

CORINE

Similarly with the CORINE dataset, category 112 Discontinuous urban fabric is very overrepresented, with clear underrepresentation for 211 Non-irrigated arable land and 312 Coniferous forest.

For reference:

Detailed land use

For the land use, many crops are underrepresented, while pastures tend to be overrepresented. This is not really surprising given where people would be expected to go to collect ants.


Proximity to human structures

For each tube, we can also use the location to calculate the distance to the nearest road and/or building, and potentially what type of road it is. This could be interesting for roads, since the dataset from OpenStreetMaps distinguishes everything from paths to highways.

Roads

There are 25 different identified classes of roads or paths.

Here are maps for each different type of road, reducing them to only the top 15 most extensive categories (total length ≥ 97km).

Buffering points should have minimal influence on distance to the nearest road, since the distance would be reduced by the buffer radius uniformly. The exception would be points nearer to a road than the buffer radius, which would all have a distance of 0m. The type of road nearest to the coordinates should be similarly (mostly) unaffected, though it is possible that a buffer could intersect multiple types of roads. For now, let’s ignore that and just use the point locations.

Many samples were collected near paths, residential roads, and service roads.

## Summary of distances to the nearest road (m):
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   5.173  13.663  27.592  28.864 520.686
## 
## Further than 200m: 79 tubes 
## Further than 400m: 11 tubes

As should be expected, most points are quite close to a road or path. A small number are quite far from any trail in the dataset.

Buildings

The building dataset doesn’t include any usages or descriptions, but consists of building footprints across the whole canton.